期刊文献+
共找到2篇文章
< 1 >
每页显示 20 50 100
Convergence-aware operator-wise mixed-precision training
1
作者 Wenhao Dai Ziyi Jia +1 位作者 Yuesi Bai Qingxiao Sun 《CCF Transactions on High Performance Computing》 2025年第1期43-57,共15页
With the support of more precision formats in emerging hardware architectures,mixed-precision has become a popular approach to accelerate deep learning(DL)training.Applying low-precision formats such as FP16 and BF16 ... With the support of more precision formats in emerging hardware architectures,mixed-precision has become a popular approach to accelerate deep learning(DL)training.Applying low-precision formats such as FP16 and BF16 to neural operators can save GPU memory while improving bandwidth.However,DL frameworks use black and white lists as default mixed-precision selections and cannot flexibly adapt to a variety of neural networks.In addition,existing work on automatic precision adjustment does not consider model convergence,and the decision cost of precision selection is high.To address the above problems,this paper proposes CoMP,a non-intrusive framework for Convergence-aware operator-wise Mixedprecision training.CoMP uses two-stage precision adjustment based on epochs and batches to ensure convergence and performance respectively.After that,CoMP performs subsequent training according to the searched optimal operator-wise mixed-precision plan.The experimental results on A100 GPU show that CoMP achieves a maximum performance speedup of 1.15×compared with PyTorch AMP implementation,while also saving up to 29.81%of GPU memory. 展开更多
关键词 GPU Mixed-precision Neural network training Auto-tuning Performance optimization
在线阅读 下载PDF
_(ν)GNN:Non‑Uniformly partitioned full‑graph GNN training on mixed GPUs
2
作者 Hemeng Wang Wenqing Lin +1 位作者 Qingxiao Sun Weifeng Liu 《CCF Transactions on High Performance Computing》 2025年第4期305-322,共18页
Graph neural networks(GNNs)can be adapted to GPUs with high computing capability due to massive arithmetic opera-tions.Compared with mini-batch training,full-graph training does not require sampling of the input graph... Graph neural networks(GNNs)can be adapted to GPUs with high computing capability due to massive arithmetic opera-tions.Compared with mini-batch training,full-graph training does not require sampling of the input graph and halo region,avoiding potential accuracy losses.Current deep learning frameworks evenly partition large graphs to scale GNN training to distributed multi-GPU platforms.On the other hand,the rapid revolution of hardware requires technology companies and research institutions to frequently update their equipment to cope with the latest tasks.This results in a large-scale cluster with a mixture of GPUs with various computational capabilities and hardware specifications.However,existing works fail to consider sub-graphs adapted to different GPU generations,leading to inefficient resource utilization and degraded training efficiency.Therefore,we propose_(ν)GNN,a Non-Uniformly partitioned full-graph GNN training framework on heterogeneous distributed platforms._(ν)GNN first models the GNN processing ability of hardware based on various theoretical parameters.Then,_(ν)GNN automatically obtains a reasonable task partitioning scheme by combining hardware,model,and graph dataset information.Finally,_(ν)GNN implements an irregular graph partitioning mechanism that allows GNN training tasks to execute efficiently on distributed heterogeneous systems.The experimental results show that in real-world scenarios with a mixture of GPU generations,_(ν)GNN can outperform other static partitioning schemes based on hardware specifications. 展开更多
关键词 Graph neural network Distributed training Graph partitioning GPU
在线阅读 下载PDF
上一页 1 下一页 到第
使用帮助 返回顶部