Large Language Models(LLMs)are expanding their applications across various fields,including Industrial Internet of Things(IIoT),where they analyze sensor data,automate diagnostics,and enhance predictive maintenance.LL...Large Language Models(LLMs)are expanding their applications across various fields,including Industrial Internet of Things(IIoT),where they analyze sensor data,automate diagnostics,and enhance predictive maintenance.LLM inference is provided by service providers to users,with each inference request undergoing two phases:prefill and decode.Due to the autoregressive nature of generation,only one token can be produced per iteration,necessitating multiple iterations to complete a request.Typically,batch processing groups multiple requests into a single batch for inference,improving throughput and hardware utilization.However,in service systems,a fixed batch size presents challenges under fluctuating request volumes,particularly in IIoT environments,where data flow can vary significantly.Specifically,during the high-load periods,a fixed batch size may lead to underutilization of resources,while during the low-load periods,it may result in resource wastage.In this paper,we introduce FlexiDecode Scheduler(FDS)to address these challenges by dynamically adjusting the decoding batch size based on system load conditions,improving resource utilization,and reducing wait time during high-load periods.FDS prioritizes prefilling new requests to maximize decoding efficiency and employs a request output length predictor to optimize request scheduling,minimizing End-to-End(E2E)latency.Compared to virtual Large Language Model(vLLM)and Sarathi,our approach achieves a 23%and 16%reduction in E2E latency,improves actual request execution time by 34%and 15%,respectively,and increases computational utilization by 10%.展开更多
The volume of instant delivery has witnessed a significant growth in recent years.Given the involvement of numerous heterogeneous stakeholders,instant delivery operations are inherently characterized by dynamics and u...The volume of instant delivery has witnessed a significant growth in recent years.Given the involvement of numerous heterogeneous stakeholders,instant delivery operations are inherently characterized by dynamics and uncertainties.This study introduces two order dispatching strategies,namely task buffering and dynamic batching,as potential solutions to address these challenges.The task buffering strategy aims to optimize the assignment timing of orders to couriers,thereby mitigating demand uncertainties.On the other hand,the dynamic batching strategy focuses on alleviating delivery pressure by assigning orders to couriers based on their residual capacity and extra delivery dis tances.To model the instant delivery problem and evaluate the performances of order dis patching strategies,Adaptive Agent-Based Order Dispatching(ABOD)approach is developed,which combines agent-based modelling,deep reinforcement learning,and the Kuhn-Munkres algorithm.The ABOD effectively captures the system’s uncertainties and heterogeneity,facilitating stakeholders learning in novel scenarios and enabling adap tive task buffering and dynamic batching decision-makings.The efficacy of the ABOD approach is verified through both synthetic and real-world case studies.Experimental results demonstrate that implementing the ABOD approach can lead to a significant increase in customer satisfaction,up to 275.42%,while simultaneously reducing the deliv ery distance by 11.38%compared to baseline policies.Additionally,the ABOD approach exhibits the ability to adaptively adjust buffering times to maintain high levels of customer satisfaction across various demand scenarios.As a result,this approach offers valuable sup port to logistics providers in making informed decisions regarding order dispatching in instant delivery operations.展开更多
In this paper, a fabrication scheduling problem concerning the production of components at a single manufacturing facility was studied, in which the manufactured components are subsequently assembled into a finite num...In this paper, a fabrication scheduling problem concerning the production of components at a single manufacturing facility was studied, in which the manufactured components are subsequently assembled into a finite number of end products. Each product was assumed to comprise a common component to all jobs and a unique component to itself. Common operations were processed in batches and each batch required a setup time. A product is completed when both its two operations have been processed and are available. The optimality criterion considered was the minimization of weighted flow time. For this scheduling problem, the optimal schedules were described in a weignted shortest processing time first (WSPT) order and two algorithms were constructed corresponding to the batch availability and item availability, respectively.展开更多
基金supported by the National Science and Technology Major Project(No.2022ZD0116800)the NSFC International Young Scientists Fund(No.62350410478)+2 种基金the Taishan Scholars Program(Nos.TSQNZ20230621 and TSQN202211214)the Shandong Excellent Young Scientists Fund(Overseas)(No.2023HWYQ-113)the Shandong Provincial Natural Science Foundation(No.ZR20221150015).
文摘Large Language Models(LLMs)are expanding their applications across various fields,including Industrial Internet of Things(IIoT),where they analyze sensor data,automate diagnostics,and enhance predictive maintenance.LLM inference is provided by service providers to users,with each inference request undergoing two phases:prefill and decode.Due to the autoregressive nature of generation,only one token can be produced per iteration,necessitating multiple iterations to complete a request.Typically,batch processing groups multiple requests into a single batch for inference,improving throughput and hardware utilization.However,in service systems,a fixed batch size presents challenges under fluctuating request volumes,particularly in IIoT environments,where data flow can vary significantly.Specifically,during the high-load periods,a fixed batch size may lead to underutilization of resources,while during the low-load periods,it may result in resource wastage.In this paper,we introduce FlexiDecode Scheduler(FDS)to address these challenges by dynamically adjusting the decoding batch size based on system load conditions,improving resource utilization,and reducing wait time during high-load periods.FDS prioritizes prefilling new requests to maximize decoding efficiency and employs a request output length predictor to optimize request scheduling,minimizing End-to-End(E2E)latency.Compared to virtual Large Language Model(vLLM)and Sarathi,our approach achieves a 23%and 16%reduction in E2E latency,improves actual request execution time by 34%and 15%,respectively,and increases computational utilization by 10%.
基金This work was supported in part by the National Natural Science Foundation of China[72101188]the Shanghai Municipal Science and Technology Major Project[2021SHZDZX0100]the Fundamental Research Funds for the Central Universities.
文摘The volume of instant delivery has witnessed a significant growth in recent years.Given the involvement of numerous heterogeneous stakeholders,instant delivery operations are inherently characterized by dynamics and uncertainties.This study introduces two order dispatching strategies,namely task buffering and dynamic batching,as potential solutions to address these challenges.The task buffering strategy aims to optimize the assignment timing of orders to couriers,thereby mitigating demand uncertainties.On the other hand,the dynamic batching strategy focuses on alleviating delivery pressure by assigning orders to couriers based on their residual capacity and extra delivery dis tances.To model the instant delivery problem and evaluate the performances of order dis patching strategies,Adaptive Agent-Based Order Dispatching(ABOD)approach is developed,which combines agent-based modelling,deep reinforcement learning,and the Kuhn-Munkres algorithm.The ABOD effectively captures the system’s uncertainties and heterogeneity,facilitating stakeholders learning in novel scenarios and enabling adap tive task buffering and dynamic batching decision-makings.The efficacy of the ABOD approach is verified through both synthetic and real-world case studies.Experimental results demonstrate that implementing the ABOD approach can lead to a significant increase in customer satisfaction,up to 275.42%,while simultaneously reducing the deliv ery distance by 11.38%compared to baseline policies.Additionally,the ABOD approach exhibits the ability to adaptively adjust buffering times to maintain high levels of customer satisfaction across various demand scenarios.As a result,this approach offers valuable sup port to logistics providers in making informed decisions regarding order dispatching in instant delivery operations.
文摘In this paper, a fabrication scheduling problem concerning the production of components at a single manufacturing facility was studied, in which the manufactured components are subsequently assembled into a finite number of end products. Each product was assumed to comprise a common component to all jobs and a unique component to itself. Common operations were processed in batches and each batch required a setup time. A product is completed when both its two operations have been processed and are available. The optimality criterion considered was the minimization of weighted flow time. For this scheduling problem, the optimal schedules were described in a weignted shortest processing time first (WSPT) order and two algorithms were constructed corresponding to the batch availability and item availability, respectively.