Ask a Question

Prefer a chat interface with context about you and your work?

SpecExec: Massively Parallel Speculative Decoding for Interactive LLM Inference on Consumer Devices

SpecExec: Massively Parallel Speculative Decoding for Interactive LLM Inference on Consumer Devices

As large language models gain widespread adoption, running them efficiently becomes crucial. Recent works on LLM inference use speculative decoding to achieve extreme speedups. However, most of these works implicitly design their algorithms for high-end datacenter hardware. In this work, we ask the opposite question: how fast can we run …