SpecInfer: Accelerating Large Language Model Serving with Tree-based Speculative Inference and Verification
SpecInfer: Accelerating Large Language Model Serving with Tree-based Speculative Inference and Verification
This paper introduces SpecInfer, a system that accelerates generative large language model (LLM) serving with treebased speculative inference and verification.The key idea behind SpecInfer is leveraging small speculative models to predict the LLM's outputs; the predictions are organized as a token tree, whose nodes each represent a candidate token sequence.The …