Speculative decoding accelerates Large Language Model(LLM)inference by allowing a lightweight draft model to predict multiple future tokens that are subsequently verified by a larger target model.In AI-native Radio Ac...Speculative decoding accelerates Large Language Model(LLM)inference by allowing a lightweight draft model to predict multiple future tokens that are subsequently verified by a larger target model.In AI-native Radio Access Networks(AI-RAN),this mechanism naturally enables device-edge collaborative inference.However,existing distributed speculative decoding schemes incur significant uplink communication overhead,as they require transmitting full-vocabulary logits at every decoding step.To address this challenge,we propose a sparsify-then-sample strategy,termed Truncated Sparse Logits Transmission(TSLT),which transmits only the logits and indices of a truncated candidate set.We provide theoretical guarantees showing that TSLT preserves the acceptance rate of speculative decoding.The proposed framework is further extended to a multi-candidate setting,where multiple draft candidates per step increase the acceptance probability.Extensive experiments demonstrate that TSLT substantially reduces uplink communication while maintaining end-to-end inference latency and model quality,validating its effectiveness for scalable and communication-efficient distributed LLM inference in future AI-RAN systems.展开更多
基金supported by National Key Research and Development Program of China(2024YFE0200800)in part by Major Key Project of PCL(PCL2025AS209)in part by Guangdong S&T Programme(2024B0101010003).
文摘Speculative decoding accelerates Large Language Model(LLM)inference by allowing a lightweight draft model to predict multiple future tokens that are subsequently verified by a larger target model.In AI-native Radio Access Networks(AI-RAN),this mechanism naturally enables device-edge collaborative inference.However,existing distributed speculative decoding schemes incur significant uplink communication overhead,as they require transmitting full-vocabulary logits at every decoding step.To address this challenge,we propose a sparsify-then-sample strategy,termed Truncated Sparse Logits Transmission(TSLT),which transmits only the logits and indices of a truncated candidate set.We provide theoretical guarantees showing that TSLT preserves the acceptance rate of speculative decoding.The proposed framework is further extended to a multi-candidate setting,where multiple draft candidates per step increase the acceptance probability.Extensive experiments demonstrate that TSLT substantially reduces uplink communication while maintaining end-to-end inference latency and model quality,validating its effectiveness for scalable and communication-efficient distributed LLM inference in future AI-RAN systems.