期刊文献+
共找到1篇文章
< 1 >
每页显示 20 50 100
Minimizing transformer inference overhead using controlling element on Shenwei AI accelerator
1
作者 Yulong ZHAO Chunzhi WU +7 位作者 Yizhuo WANG Lufei ZHANG Yaguang ZHANG Wenyuan SHEN Hao FAN Hankang FANG Yi QIN Xin LIU 《Frontiers of Information Technology & Electronic Engineering》 2025年第4期605-622,共18页
Transformer models have become a cornerstone of various natural language processing(NLP)tasks.However,the substantial computational overhead during the inference remains a significant challenge,limiting their deployme... Transformer models have become a cornerstone of various natural language processing(NLP)tasks.However,the substantial computational overhead during the inference remains a significant challenge,limiting their deployment in practical applications.In this study,we address this challenge by minimizing the inference overhead in transformer models using the controlling element on artificial intelligence(AI)accelerators.Our work is anchored by four key contributions.First,we conduct a comprehensive analysis of the overhead composition within the transformer inference process,identifying the primary bottlenecks.Second,we leverage the management processing element(MPE)of the Shenwei AI(SWAI)accelerator,implementing a three-tier scheduling framework that significantly reduces the number of host-device launches to approximately 1/10000 of the original PyTorch-GPU setup.Third,we introduce a zero-copy memory management technique using segment-page fusion,which significantly reduces memory access latency and improves overall inference efficiency.Finally,we develop a fast model loading method that eliminates redundant computations during model verification and initialization,reducing the total loading time for large models from 22128.31 ms to 1041.72 ms.Our contributions significantly enhance the optimization of transformer models,enabling more efficient and expedited inference processes on AI accelerators. 展开更多
关键词 Transformer inference optimization Three-tier scheduling zero-copy memory management Fast model loading
原文传递
上一页 1 下一页 到第
使用帮助 返回顶部