SpecExec: Massively Parallel Speculative Decoding for Interactive LLM
Inference on Consumer Devices
SpecExec: Massively Parallel Speculative Decoding for Interactive LLM
Inference on Consumer Devices
As large language models gain widespread adoption, running them efficiently becomes crucial. Recent works on LLM inference use speculative decoding to achieve extreme speedups. However, most of these works implicitly design their algorithms for high-end datacenter hardware. In this work, we ask the opposite question: how fast can we run …