Accelerating Inference in Large Language Models with a Unified Layer
Skipping Strategy
Accelerating Inference in Large Language Models with a Unified Layer
Skipping Strategy
Recently, dynamic computation methods have shown notable acceleration for Large Language Models (LLMs) by skipping several layers of computations through elaborate heuristics or additional predictors. However, in the decoding process of existing approaches, different samples are assigned different computational budgets, which cannot guarantee a stable and precise acceleration effect. Furthermore, …