期刊文献+
共找到1篇文章
< 1 >
每页显示 20 50 100
Dynamic Batch Processing with FlexiDecode Scheduler for Efficient LLM Inference in IIoT
1
作者 Xiaocong Jia Bruce Gu +5 位作者 Jinjun Chen Longxiang Gao Weiguang Pang Guangtong Lv Youyang Qu Lei Cui 《Big Data Mining and Analytics》 2025年第6期1307-1323,共17页
Large Language Models(LLMs)are expanding their applications across various fields,including Industrial Internet of Things(IIoT),where they analyze sensor data,automate diagnostics,and enhance predictive maintenance.LL... Large Language Models(LLMs)are expanding their applications across various fields,including Industrial Internet of Things(IIoT),where they analyze sensor data,automate diagnostics,and enhance predictive maintenance.LLM inference is provided by service providers to users,with each inference request undergoing two phases:prefill and decode.Due to the autoregressive nature of generation,only one token can be produced per iteration,necessitating multiple iterations to complete a request.Typically,batch processing groups multiple requests into a single batch for inference,improving throughput and hardware utilization.However,in service systems,a fixed batch size presents challenges under fluctuating request volumes,particularly in IIoT environments,where data flow can vary significantly.Specifically,during the high-load periods,a fixed batch size may lead to underutilization of resources,while during the low-load periods,it may result in resource wastage.In this paper,we introduce FlexiDecode Scheduler(FDS)to address these challenges by dynamically adjusting the decoding batch size based on system load conditions,improving resource utilization,and reducing wait time during high-load periods.FDS prioritizes prefilling new requests to maximize decoding efficiency and employs a request output length predictor to optimize request scheduling,minimizing End-to-End(E2E)latency.Compared to virtual Large Language Model(vLLM)and Sarathi,our approach achieves a 23%and 16%reduction in E2E latency,improves actual request execution time by 34%and 15%,respectively,and increases computational utilization by 10%. 展开更多
关键词 virtual large Language model(vLLM)inference batch scheduling dynamic decoding batches calculating utilization
原文传递
上一页 1 下一页 到第
使用帮助 返回顶部