ExpertFlow: Optimized Expert Activation and Token Allocation for
Efficient Mixture-of-Experts Inference
ExpertFlow: Optimized Expert Activation and Token Allocation for
Efficient Mixture-of-Experts Inference
Sparse Mixture of Experts (MoE) models, while outperforming dense Large Language Models (LLMs) in terms of performance, face significant deployment challenges during inference due to their high memory demands. Existing offloading techniques, which involve swapping activated and idle experts between the GPU and CPU, often suffer from rigid expert caching …