期刊文献+
共找到1篇文章
< 1 >
每页显示 20 50 100
Fast collaborative inference via distributed speculative decoding
1
作者 Ce Zheng Ke Zhang +3 位作者 Chen Sun Wenqi Zhang Qiong Liu Angesom Ataklity Tesfay 《Journal of Information and Intelligence》 2026年第1期67-85,共19页
Speculative decoding accelerates Large Language Model(LLM)inference by allowing a lightweight draft model to predict multiple future tokens that are subsequently verified by a larger target model.In AI-native Radio Ac... Speculative decoding accelerates Large Language Model(LLM)inference by allowing a lightweight draft model to predict multiple future tokens that are subsequently verified by a larger target model.In AI-native Radio Access Networks(AI-RAN),this mechanism naturally enables device-edge collaborative inference.However,existing distributed speculative decoding schemes incur significant uplink communication overhead,as they require transmitting full-vocabulary logits at every decoding step.To address this challenge,we propose a sparsify-then-sample strategy,termed Truncated Sparse Logits Transmission(TSLT),which transmits only the logits and indices of a truncated candidate set.We provide theoretical guarantees showing that TSLT preserves the acceptance rate of speculative decoding.The proposed framework is further extended to a multi-candidate setting,where multiple draft candidates per step increase the acceptance probability.Extensive experiments demonstrate that TSLT substantially reduces uplink communication while maintaining end-to-end inference latency and model quality,validating its effectiveness for scalable and communication-efficient distributed LLM inference in future AI-RAN systems. 展开更多
关键词 Collaborative inference Speculative decoding Truncated sampling Multi-candidate Token tree
原文传递
上一页 1 下一页 到第
使用帮助 返回顶部